-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Introduce expr_fields to AccumulatorArgs to hold input argument fields
#18100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to remove schema entirely, to follow what #11725 aims for. However there are some usages that I couldn't trivially fix:
datafusion/datafusion/functions-aggregate/src/array_agg.rs
Lines 207 to 210 in 9bfa2ae
| let ordering_dtypes = ordering | |
| .iter() | |
| .map(|e| e.expr.data_type(acc_args.schema)) | |
| .collect::<Result<Vec<_>>>()?; |
datafusion/datafusion/functions-aggregate/src/nth_value.rs
Lines 158 to 161 in 9bfa2ae
| let ordering_dtypes = ordering | |
| .iter() | |
| .map(|e| e.expr.data_type(acc_args.schema)) | |
| .collect::<Result<Vec<_>>>()?; |
datafusion/datafusion/functions-aggregate/src/first_last.rs
Lines 148 to 151 in 9bfa2ae
| let ordering_dtypes = ordering | |
| .iter() | |
| .map(|e| e.expr.data_type(acc_args.schema)) | |
| .collect::<Result<Vec<_>>>()?; |
Not to mention it might be more breaking to remove it (we could deprecate it I guess).
| /// Fields corresponding to each expr (same order & length). | ||
| pub expr_fields: &'a [FieldRef], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Main change is here
| let arg_fields = args | ||
| .iter() | ||
| .map(|e| e.return_field(schema.as_ref())) | ||
| .collect::<Result<Vec<_>>>()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is how we construct the eventual expr_fields; we're essentially doing this e.return_field(schema) pattern up front for the user, instead of requiring them to do it each time they need the field (see the various other fixes in this PR which is replacing those accesses with a simplified version)
|
fyi @kosiew I tried implementing like this and it seems like no issues with regressions, thoughts on if this fix is simpler? |
|
Your approach is an improvement! ✅ Simpler implementation - straightforward addition of pre-computed fields 👍 👍 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @Jefffrey ,
I think you missed these:
https://github.com/apache/datafusion/blob/9bfa2ae/datafusion/physical-expr/src/aggregate.rs#L616
When AggregateFunctionExpr::with_new_expressions rewrites an aggregate to use new argument expressions, it clones the prior arg_fields. This means the expr_fields handed to the accumulator can carry metadata from the old expressions instead of the newly supplied ones, defeating the purpose of the added field (e.g., extension metadata will never update for rewritten expressions).
Both approx_percentile_cont_with_weight and the ordered/distinct branch of string_agg build nested AccumulatorArgs by filtering acc_args.exprs, but they reuse the original expr_fields slice unchanged. After the filtering, the positions (and lengths) of exprs and expr_fields no longer match, so downstream code will read the wrong field metadata once it looks beyond the first argument.
Thanks for picking up on this; I actually did try doing the change inside |
|
1c87652 fixes the Still looking into |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| pub struct ForeignAccumulatorArgs { | ||
| pub return_field: FieldRef, | ||
| pub schema: Schema, | ||
| pub expr_fields: Vec<FieldRef>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will this be a breaking FFI change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, since FFI_AccumulatorArgs seems to be the one marked as being stable across FFI boundaries:
datafusion/datafusion/ffi/src/udaf/accumulator_args.rs
Lines 44 to 56 in 9f23680
| /// A stable struct for sharing [`AccumulatorArgs`] across FFI boundaries. | |
| /// For an explanation of each field, see the corresponding field | |
| /// defined in [`AccumulatorArgs`]. | |
| #[repr(C)] | |
| #[derive(Debug, StableAbi)] | |
| #[allow(non_camel_case_types)] | |
| pub struct FFI_AccumulatorArgs { | |
| return_field: WrappedSchema, | |
| schema: WrappedSchema, | |
| is_reversed: bool, | |
| name: RString, | |
| physical_expr_def: RVec<u8>, | |
| } |
Though I am not familiar with the FFI related code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI @timsaucer -- perhaps you can confirm this doesn't mess up the FFI API
Which issue does this PR close?
AccumulatorArgs.schemais empty when passing in scalar input #16997AccumulatorArgsafter transition to all udafs #11725Rationale for this change
When reviewing #17085 I was very confused by the fix suggested, and tried to understand why
AccumulatorArgsdidn't have easy access toFields of its input expressions, as compared to scalar/window functions which do. Introducing this new field should make it easier for users to grab datatype, metadata, nullability of their input expressions for aggregate functions.What changes are included in this PR?
Add a slice of
FieldReftoAccumulatorArgsso users don't need to compute the input expression fields themselves via using schema. This addresses #16997 as it was confusing to have only the schema available as there are valid (?) cases where the schema is empty (such as literal only input).This fix differs from #17085 in that it doesn't special case for when there is literal only input; it leaves the physical
schemaprovided toAccumulatorArgsuntouched but provides a more ergonomic (and less confusing) API for users to retrieveFields of their input arguments.schemaentirely fromAccumulatorArgsmaybe we wouldn't need to worry about this, but see my comment for why that wasn't done in this PRAre these changes tested?
Existing unit tests.
Are there any user-facing changes?
Yes, new field to
AccumulatorArgswhich is publicly exposed (with all it's fields).